Introduction to Open Data Science - Course Project

About the project

This is the project page for the Introduction to Open Data Science course of fall 2022. A link to the repository for this project page is here: IODS-project (github.com/teemuvh/IODS-project).

During this course, we get acquainted with R and certain data science methods, such as linear and logistic regression, clustering and classification algorithms, and dimensionality reduction models. During this course, we use the textbooks (1) Vehkalahti, Kimmo & Everitt, Brian S. (2019). Multivariate Analysis for the Behavioral Sciences and (2) Ewen Harrison and Riinu Pius (2021). R for Health Data Science.

Week 1: Start me up!

During the first week, the focus is on getting everything up and running, and getting familiar with the syntax of R before moving to the more challenging topics of the course.

I use Python in my daily work, but so far I like how simple it is to read data into R.

# Here, we could load a csv-file to R as follows:
# data <- read_csv("some_data.csv")
# And view it as follows:
# View(data)
# But since we do not have "some_data" at our disposal,
# I have just added this as a comment.

# However, we can print the date:
date()
## [1] "Thu Dec  1 10:18:51 2022"

The Harrison & Pius book seems to work well for revising R syntax as well as learning some new tricks already for data transformations (mutate(), summarise(), pivot_wider(), factor(), etc.) The difficult part will be remembering when to utilise these functions…

In the Vehkalahti & Everitt textbook chapters 1 and 2, I liked how the same data were presented in different visualisation formats, which made me reflect a lot about how I could have presented some of my data before. I will attempt to use this information in the future to actually focus a bit more on what could be an optimal visualisation method for some given data that I work with. Additionally, it was fun to get to draw different visualisations with R, and again it seems simpler compared to Python. In terms of how the material works as a “crash course” to R, I think that having a background in programming helps in getting started with the material, since it makes the programming part a bit easier. I believe the material could work if one comes with no background in programming, but the learning curve is a bit steeper in the beginning.

It is nice to get to refresh my memory with the statistical methods as well. I have taken an introductory course to statistics but it was many years ago, so I have to recap/re-learn many of the topics. Thus, my favourite part of this week has been to actually refreshing my memory on these topics in statistics and I expect to learn a lot especially with respect to this field. I believe this information will be useful for me in the future in my own research.


Regression and model validation

Jump straight to the assignments.

date()
## [1] "Thu Dec  1 10:18:51 2022"

The core idea of regression analysis is to use statistical methods to assess the way in which a change in circumstances affect variation in an observed random variable. This section summarises my notes for the chapters regarding Simple linear regression and Multiple linear regression.

There are four basic assumptions in linear regression:

  1. The relationship between predictors and outcome is linear
  2. The residuals are independent
  3. The residuals are normally distributed
  4. The residuals have equal variance

Simple linear regression

Simple linear regression model (intercept \(\beta_0\) is not always needed), has only a single explanatory variable (\(x\)) and a single dependent variable (\(y\)):

\[ y_i = \beta_0 + \beta_1 x_i + \epsilon_i \]

where \(\beta_0\) is the intercept, \(\beta_1\) is the slope of the linear relationship between the dependent variable and the explanatory variable, and \(\epsilon\) is an error term that measures the difference between the observed value and the prediction from the model.

The prediction of the model is presented without the loss term as:

\[ \hat{y} = \beta_0 + \beta_1 x_i \]

This equation can then be used for predicting the value of a dependent variable for some explanatory variable.

The variability of the dependent by Regression Means Square (RGMS) and Residual Mean Square (RMS):

\[ RGMS = \sum_{i=1}^{n} (\hat{y_i} - \overline{y}_i)^2 \] \[ RMS = \sum_{i=1}^{n} \frac{(y_i - \hat{y_i})^2}{(n-2)} \]

Regression diagnostics

We can assess our model by calculating the difference between an observed value and our prediction:

\[ \hat{\epsilon} = y - \hat{y} \]

R prints another measure for assessing how close the data are to the fitted line, R-squared:

\[ R^2 = 1 - \frac{\sum_{i=1}^{n}(y - \hat{y})^2}{\sum_{i=1}^{n}(y - \overline{y})^2} \] where \(\overline{y}\) is the mean value of the sample. R-squared represents the proportion of the dependent variable which is explained by the explanatory variable. If R-squared is 0.0, then the explanatory variable has no effect on the dependent variable. 1.0 indicates that all of the variability is explained by the model (i.e., the regression line fits perfectly). The adjusted R-squared includes a penalty term that lowers the value for less important explanatory variables.

And we should also use visualisation to assess the model; useful plots include:

  • Boxplot (or a Q-Q plot) of the residuals
  • Residuals against the corresponding values of the explanatory variables
  • Residuals against the fitted values of the response variable

We can decide whether a linear model is appropriate or we should use a non-linear model.

Multiple linear regression

Here, we have more explanatory variables than in the simple regression (e.g., analysing change in blood pressure based on coffee consumption on smokers and non-smokers). Basically, we have multiple parameters that effect the prediction.

\[ \hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 \]

Confounding is a concept referring to a situation where another explanatory variable distorts the relationship between an explanatory variable and the outcome (e.g., smoking could be related to coffee consumption and to blood pressure).

Assignments

1. Data wrangling

Done and wrangled data in my GitHub repo.

2. Regression analysis

Task 1:

First, we need to read the data we created in the data wrangling exercise. The data is stored in the data dictionary. Then we quickly check the structure and the dimensions of the data.

learning2014 <- read.table("data/learning2014.csv", sep=",", header=T)

dim(learning2014)
## [1] 166   7
str(learning2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...

dim() outputs the dimensionality of the data ([166, 7]), whereas str() outputs, in addition to the number of observations and variables (i.e., dimensions), a little more information about the data. str() also outputs a few examples of the observations in the seven variables.

The variables in the data are gender, age, attitude, deep, stra, surf, and points. I think that in general, this data is about students’ attitudes towards studying statistics and how they feel about learning the topics on some statistics course. The different variables indicate a student’s gender, the age of the student, student’s attitude towards statistics (averaged over 10 answers to questions on a Likert scale). From the link given in the material, I interpret that Deep, strategic (stra), and surface (surf) indicate learning methods based on some clusters of questions related to these methods. Points indicates the points a student got from the course exam. In the dataset considered in this exercise, we have the previous variables for 166 students.

Task 2:

# access the GGally and ggplot2 libraries
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(ggplot2)

# create a plot matrix with ggpairs()
p <- ggpairs(learning2014, mapping = aes(col=gender, alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))

# draw the plot
p

I summarised the data to one scatterplot matrix From this visualisation above, we can observe different dimensions and how they interact. Scatterplots are drawn on the left side of the diagonal, the diagonal depicts the variable distribution, whereas the right side of the matrix displays the Pearson correlation.

First, we see that the sample has quite a lot more female students than males. Most of the students in both group were around the age of 20-25. Male students seem to have had slightly more positive attitude towards statistics, having most of the mass in the middle of the Likert scale (around 3.5), but also more answers in the higher end of the scale compared to female students. Male students seem to have had a bit stronger tendency towards deep learning strategy, whereas female students seem to have been more prone to the strategic dimension. Female students also seem to have been more prone to this surface dimension. Points in the exam seem quite similarly distributed between the groups. However, it seems that more male students have acquired the highest points.

Most of the variables seem not to be correlated based on the Pearson correlation coefficient. However, a few of them are. Quite understandably, attitude seems to correlate with the points acquired.

Task 3:

Now, we look at multivariable regression. Let the first explanatory variable be attitude, the second variable can be stra, and the third one surf. The decided variables are based on their correlation indicated by the Pearson correlation coefficient in the visualisation above.

We can first visualise all of the variables relationship with the exam scores independently:

library(ggplot2)
qplot(attitude, points, data = learning2014) + geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

library(ggplot2)
qplot(stra, points, data = learning2014) + geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

library(ggplot2)
qplot(surf, points, data = learning2014) + geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

Based on the three visualisations, it seems that both attitude and stra have a positive correlation with the exam results. Surf, unsurprisingly, seems to have a slightly negative correlation. Next, we fit a multivariable regression model to the given explanatory variables.

# create a regression model with multiple explanatory variables
multivariable_model <- lm(points ~ attitude + stra + surf, data = learning2014)

# summary
multivariable_model %>%
  summary()
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

The summary includes residuals in five points (min, first quartile, median, third quartile, and max). If the model fits perfectly to the data, the distribution between the residuals should be symmetrical.

Next, there is the coefficients block with values for estimate, Std. error, t-value, and p-value.

The estimate column indicates that a change of one point in attitude results in a change of approximately 3.4 points in the exam scores. With respect to stra and surf, the correlation is one to 0.9 and one to -0.6 respectively.

The Std. error indicates how far the estimates are from the true average values of the dependent variables.

T-value indicates the distance of our coefficient estimates from 0 as standard deviations. This value can be positive or negative, but it should be far from 0. A large difference to 0 would indicate relationship between the given variables (e.g., attitude and points), whereas 0 (or near to 0) means that there is no relationship.

Finally, we have the probability of getting any value that is equal or larger than t. This value should be as small as possible for us to be able to confidently reject the null hypothesis.

The significant codes are probably in the output for a quicker read of the results.

Lastly, there are the residual standard error (RSE), multiple and adjusted R-squared, and F-statistics. The RSE measures how well the model fits the data based on the residuals (\(y-\hat{y}\)). In practice, it is the square root of the mean squared error (please, correct me if I am wrong):

\[ RSE = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y_i})^2} \] R-squared (\(R^2\)) is another value to assess how well the model fits the data. In practice, R-squared represents the proportion of the dependent variable which is explained by the explanatory variable. If R-squared is 0.0, then the explanatory variable has no effect on the dependent variable. If the result is 1.0, all of the variability is explained by the model (i.e., the regression line fits perfectly). The adjusted R-squared includes a penalty term that lowers the value for less important explanatory variables.

R-squared is calculated as:

\[ R^2 = 1 - \frac{\sum_{i=1}^{n}(y - \hat{y})^2}{\sum_{i=1}^{n}(y - \overline{y})^2} \] where \(\overline{y}\) is the mean value of the sample.

F-test compares the means of the groups and indicates whether at least one variable is statistically significant.

The summary of the model indicates that only attitude has a significant correlation with the exam points, whereas the correlations between stra and points and surf and points are not statistically significant. R-squared seems low to me. Perhaps this is because we only have so little data points to fit the model to?

Next, we remove the non-significant variables and fit the model again.

# create a regression model with multiple explanatory variables
attitude_model <- lm(points ~ attitude, data = learning2014)

# summary
attitude_model %>%
  summary()
## 
## Call:
## lm(formula = points ~ attitude, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

Now, we only have the significant variable in the model.

Task 4:

Most of this task was done in the earlier task, but in short, a change of one point (on the Likert scale) in attitude results in a change of 3.5 points in the exam scores. R-squared was also explained above.

Task 5:

Finally, draw some visualisations of the residuals. First, I print the regression line of the attitude variable again, just to help in analysis:

library(ggplot2)
qplot(attitude, points, data = learning2014) + geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

Next, I start outputting the different residual plots, beginning with Residuals vs Fitted.

# plot all plots:
# plot(attitude_model, which = c(1, 2, 5))
# but for the sake of the notebook, I plot these one by one
plot(attitude_model, which=1)

On the plot above, the residuals (y-axis) are plotted with respect to the estimated responses (x-axis). The residual line does not correspond exactly to the regression line. I think (but please correct me if I am wrong) what we can read from the line is that the variance is larger in the end of the data, meaning that the model might not have fitted well. This is likely an issue of small data size in this case.

plot(attitude_model, which=2)

The Q-Q plot shows that the data is not normally distributed, but skews in both ends of the data points. What I read from this is that there might be values in both ends that are very far from the expected mean if the data were normally distributed.

plot(attitude_model, which=5)

Residuals vs Leverage shows how important different data points are to the regression model. The spread of the data points should not change as a function of leverage, but in the right end of the line it seems to change. Perhaps this is again an issue with small data size? There are no data points outside of Cook’s distance, so there are no data points whose deletion from the data would have a high influence on the regression model.


Logistic regression

Jump straight to the assignments.

date()
## [1] "Thu Dec  1 10:18:58 2022"

This part of the course will focus on Logistic Regression, which is a type of models where the output is categorical. For example, if we want to predict whether an email is spam or not, or whether a tumor is malign or benign, we could use a logistic regression model. Similarly, instead of only two, we could have multiple possible labels.

Some notes about Logistic Regression

Linear regression models the population mean of the response variable directly as a linear function of the explanatory variables:

\[ E(y|x_1, x_2, ..., x_n) = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n \]

With classification of categorical values, this is not a suitable way, but we need to output something that we can interpret as probabilities for the different outcomes. The issue with linear regression models is that the output can be any real number between \(-\infty\) and \(\infty\). Thus, we need a link function, which here is the logarithm of the odds: \(log(\frac{\pi}{1-\pi})\). Logistic regression then takes the form of:

\[ logit (\pi) = log(\frac{\pi}{1-\pi}) = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n \]

Logistic regression function rearraged is:

\[ \pi = \frac{ \exp (\beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n)}{1- \exp (\beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n)} \] Observed variable \(y\) is represented as:

\[ y = \pi(x_1, x_2, ..., x_n) + \epsilon \] , where \(\epsilon = 1 - \pi(x_1, x_2, ..., x_n)\), with probability \(\pi(x_1, x_2, ..., x_n)\) if \(y = 1\). Otherwise, if \(y = 0\), \(\epsilon = \pi(x_1, x_2, ..., x_n)\), with probability \(1 - \pi(x_1, x_2, ..., x_n)\).

Assignments

Data wrangling

Data for the following analysis exercise here.

Logistic regression

In this assignment, we will apply logistic regression to the alcohol consumption data created in the previous exercise. The original data is presented in Cortez & Silva, (2008).1

library(readr)
alc <- read_csv("data/alc.csv", show_col_types = FALSE)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "failures"   "paid"       "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

The wrangled data consists of 370 observations and 33 original variables. In practice, the observations are Portuguese secondary school students attending courses in mathematics and in Portuguese. The variables include student demographics, social and school related features, and grades. We have added another two variables, alc_use and high_use, where alc_use averages the weekday and weekend alcohol consumption, while high_use is a truth value indicating whether the students’ alcohol consumption is low (FALSE) or high (TRUE).

The goal of this exercise is to attempt at explaining the relationship between different explanatory variables to alcohol consumption. As there are so many options, we define the number of variables to be explored to four. There are many variables that I believe can explain alcohol consumption, but for this experiment, I will choose variables related to freetime activity. My hypothesis is that the more a student has ‘other’ activities (internet, romantic, activities) the less they have freetime and the less they consume alcohol.

First, we might need to change the string-formatted answers to integers.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ tibble  3.1.8     ✔ stringr 1.4.1
## ✔ tidyr   1.2.1     ✔ forcats 0.5.2
## ✔ purrr   0.3.5     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(finalfit)

alc <- alc %>%
  mutate(internet.int = factor(internet) %>%         
           fct_recode("0" = "no",
                      "1" = "yes"),

         romantic.int = factor(romantic) %>% 
           fct_recode("0" = "no",
                      "1"  = "yes"),
         
         activities.int = factor(activities) %>% 
           fct_recode("0" = "no",
                      "1"  = "yes"),
  )

Next, let’s visualise the variables with respect to alcohol consumption.

# access the tidyverse packages dplyr and ggplot2
library(dplyr); library(ggplot2)

g1 <- ggplot(data = alc, aes(x = high_use, fill = romantic.int))
g1 + geom_bar() + facet_wrap("romantic")

g2 <- ggplot(data = alc, aes(x = high_use, fill = internet.int))
g2 + geom_bar() + facet_wrap("internet")

g3 <- ggplot(data = alc, aes(x = high_use, fill = activities.int))
g3 + geom_bar() + facet_wrap("activities")

g4 <- ggplot(data = alc, aes(x = high_use, fill = freetime))
g4 + geom_bar() + facet_wrap("freetime")

Based on the barplots, it seems that students who are in a romantic relationship are less prone to high usage of alcohol. Internet, however, does not show a similar effect. Rather, it seems that students who have internet connection at home seem to consume more alcohol. This is probably explained by the fact that not many households are without internet connection.

library(dplyr)
alc %>% 
  group_by(internet.int) %>%
  summarise(percent = 100 * n() / nrow(alc))
## # A tibble: 2 × 2
##   internet.int percent
##   <fct>          <dbl>
## 1 0               15.4
## 2 1               84.6

As it seems, there are approximately 85% of households with internet connection.

Extracurricular activities seem to not have a large effect on alcohol consumption, but perhaps students who have no extracurricular activities might be slightly more prone to high alcohol consumption.

Lastly, it seems that the more free time students have, the more the smaller the gap between high and low alcohol consumption.

Next, I use logistic regression to analyse the variables further, and see whether my hypothesis really holds.

m <- glm(high_use ~ romantic.int + internet.int + activities.int + freetime, data = alc, family = "binomial")

summary(m)
## 
## Call:
## glm(formula = high_use ~ romantic.int + internet.int + activities.int + 
##     freetime, family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2100  -0.8970  -0.7151   1.3130   1.9156  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -2.0503     0.5010  -4.093 4.26e-05 ***
## romantic.int1    -0.1777     0.2510  -0.708  0.47882    
## internet.int1     0.1793     0.3320   0.540  0.58925    
## activities.int1  -0.3529     0.2329  -1.515  0.12983    
## freetime          0.3895     0.1225   3.179  0.00148 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 439.01  on 365  degrees of freedom
## AIC: 449.01
## 
## Number of Fisher Scoring iterations: 4

Based on logistic regression, it seems that only freetime is statistically significant to high alcohol consumption.

library(broom)
m %>% 
  tidy(conf.int = TRUE, exp = TRUE)
## # A tibble: 5 × 7
##   term            estimate std.error statistic   p.value conf.low conf.high
##   <chr>              <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
## 1 (Intercept)        0.129     0.501    -4.09  0.0000426   0.0466     0.334
## 2 romantic.int1      0.837     0.251    -0.708 0.479       0.508      1.36 
## 3 internet.int1      1.20      0.332     0.540 0.589       0.635      2.35 
## 4 activities.int1    0.703     0.233    -1.51  0.130       0.444      1.11 
## 5 freetime           1.48      0.123     3.18  0.00148     1.17       1.89

Above, we have the coefficients (estimate) and the confidence intervals (conf.low and conf.high) for the different variables. For the only statistically significant explanatory variable is the amount of free time, I mostly focus on that. The coefficient, approx. 1.48, indicates that moving one step in freetime, 0 (very low) to 5 (very high), multiplies the odds of high alcohol consumption by 1.48. Or we can state that increase in freetime by one point increases the odds of high alcohol consumption by 48%. The negative coefficients (<1 in the tidy output) indicate that these explanatory variables (romantic relationship and extracurricular activities) have a decreasing effect on the risk of high alcohol consumption. I.e., if one is in a romantic relationship or has extracurricular activities, they are less prone for high alcohol consumption.

m <- glm(high_use ~ freetime, data = alc, family = "binomial")
alc <- mutate(alc, probability = predict(m, type = "response"))
alc <- mutate(alc, prediction = probability > 0.5)

table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE
##    FALSE   259
##    TRUE    111

It seems that the model fitted only on freetime always predicts FALSE. We can calculate our model’s accuracy from the confusion matrix by:

\[ acc = \frac{(TP + TN)}{(P + N)} \]

For our model, that would result in \(acc = \frac{259}{370} = 0.7\)

# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.3

Our loss function returns a loss of 0.3, indicating that, on average, we predict the wrong label by a probability of 30%.

library(dplyr)
alc %>% 
  group_by(high_use) %>%
  summarise(percent = 100 * n() / nrow(alc))
## # A tibble: 2 × 2
##   high_use percent
##   <lgl>      <dbl>
## 1 FALSE         70
## 2 TRUE          30

The data consists of 70% of FALSE labels. Thus, if we would always guess the majority class, like our model actually does, our baseline accuracy would be 0.7. Thus, any useful model should be able to obtain accuracy higher than that.

Bonus exercise 1

m2 <- glm(high_use ~ sex + failures + absences + famrel + goout + health, data = alc, family = "binomial")
alc <- mutate(alc, probability = predict(m2, type = "response"))
alc <- mutate(alc, prediction = probability > 0.5)
# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m2, K = 10)

# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2081081

I can get loss of approximately 0.2 in cross-validation by adding sex, failures, absences, famrel, goout, and health to the model features.


Clustering and classification

date()
## [1] "Thu Dec  1 10:19:01 2022"

This chapter focuses on a set of clustering methods, designed for visualizing and exploring statistical data. In general, we want to train a model that can position data points to different clusters (or groups) based on their characteristics. After the model is trained, it can be used for unseen data to classify that data to the learnt clusters. The most popular of these methods is k-means clustering.

Assignments

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
# load the data
data("Boston")

# explore the dataset
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14

The Boston dataset represents different explanatory variables related to the housing value in suburbs of Boston. The data consists of 506 observations and 14 variables. In practice, the variables are such that can be assumed to impact housing values, e.g., crime rate, air quality, average number of rooms, median value of owner-occupied homes, etc. More information about the data can be found from here.

# plot matrix of the variables
pairs(Boston)

Plot matrix shows a graphical overview of the data. However, this visualization is rather difficult to read. Correlation matrix provides a slightly clearer visualization.

library(corrplot)
## corrplot 0.92 loaded
# calculate the correlation matrix and round it
cor_matrix <- cor(Boston)

# visualize the correlation matrix
corrplot(cor_matrix, method="circle")

From the correlation matrix, we can interpret the relationships between different variables. For instance, it seems that proportion of non-retail business acres per town, indicated by indus, has a rather high negative correlation with weighted mean of distances to five Boston employment centres, indicated by dis. Unsurprisingly, indus also seems to have a high (positive) correlation with nitrogen oxides concentration (parts per 10 million), indicated by nox. Median value of owned homes (medv) seems to correlate positively with number of rooms (rm), and negatively with lower status of the population (lstat).

What I think we can read from this data is, for instance, that in areas where the median value of owned homes (medv) is high, we see less crime, less non-retail businesses, better air quality, more rooms in houses, less pupils per teacher, etc, based on the correlations between the variables.

Summaries of the variables:

# summaries of the variables of the data
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Next, we standardize the values with:

\[ scaled(x) = \frac{x-mean(x)}{std(x)} \]

# center and standardize variables
boston_scaled <- scale(Boston)

# summaries of the scaled variables
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)

Next, we use the scaled values of the crime rate to create bins where we store crime rates as categorical variables (scale of four from low crime rate to high crime rate). Furthermore, we drom the old crime rates from the data, and replace them with the crime rate quantiles.

boston_scaled$crim <- as.numeric(boston_scaled$crim)

# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim,
             breaks = bins,
             include.lowest = TRUE,
             label = c("low", "med_low", "med_high", "high"))

# look at the table of the new factor crime
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

Before we can train a model for k-means clustering (or any other ML technique), we should split the data to train and test sets. Here, we create a split of 80% of training data, and 20% of test data. For tuning the model, we might want to also get development data, so our split could be 80-10-10, but for now, we keep with 80-20. To the best of my understanding, from now on, each row in the data (consisting of observations for 14 variables) will become a 13-dimensional vector that represents one data point for our model. The output, predicted from the 13-dimensional vector should then be the crime rate class (4 classes: low, med_low, med_high, high).

# number of rows in the Boston dataset 
n <- nrow(boston_scaled)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]

# save the correct classes from test data
correct_classes <- test$crime

# remove the crime variable from test data
test <- dplyr::select(test, -crime)

nrow(train); nrow(test);
## [1] 404
## [1] 102

Next, we fit the LDA model to the training data.

# LDA
lda.fit <- lda(crime ~ ., data = train)

# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(crime)

# plot the lda results
classes <- as.numeric(train$crime)
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)

In the 2-dimensional visualization of the Linear Discriminant Analysis, we see that the crime clusters are rather clear, high crime rates being in the right part of the space, whereas med_high is positioned to the bottom of the space. Low crime rate is at the top left of the space. Med_high and med_low seem to create a larger cloud that is not very clearly separated but it is somewhat overlapping, likely because their values are also close to each others when we generated the bins of the crime rates. Additionally, from the visualization, we can observe where variables are clustered by the model. We see that the model associates index of accessibility to radial highways (rad) quite strongly with high crime rate, whereas proportion of residential land zoned for lots over 25,000 sq.ft. (zn) is quite strongly associated with low crime rate.

Let us then see how the clustering model generalizes to unseen data.

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
predictions <- lda.pred$class
table(correct = correct_classes, predicted = predictions)
##           predicted
## correct    low med_low med_high high
##   low       12      11        1    0
##   med_low    6      17        4    0
##   med_high   0      11       19    0
##   high       0       0        0   21

The model does not seem very robust, and it predicts wrong labels especially when the correct label is med_low. However, those are the more difficult examples to predict correctly.

Finally, let us calculate the distances between the observations, and run k-means clustering.

library(MASS)
data("Boston")
boston_scaled <- as.data.frame(scale(Boston))

# euclidean distance
dist <- dist(boston_scaled, method = "euclidean")

# look at the summary of the distances
summary(dist)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970

K-means

set.seed(13)

# k-means clustering with 4 clusters
km <- kmeans(Boston, centers = 4)

# plot the Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)

The clusters are not very visible from such a small scatter plot matrix.

pairs(boston_scaled[6:10], col = km$cluster)

Here we notice that some of the clusters are on top of each others, and the number of clusters is potentially too high. Let’s attempt at finding the optimal number of clusters.

set.seed(123)

# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

Above, the within cluster sum of squares (WCSS) provides us a way to find the optimal number of clusters. Practically, the optimal number is that where the value changes radically. From the visualization, we can approximate that the optimal number of clusters is 2, because at that point the value changes quite a lot.

# New k-means with 2 clusters
km <- kmeans(boston_scaled, centers = 2)

# plot the scaled Boston dataset with 2 clusters
pairs(Boston, col = km$cluster)

pairs(boston_scaled[6:10], col = km$cluster)

Now the clusters are perhaps more visible, and, hopefully, in their own positions in the space.


Dimensionality reduction

This weeks material will focus on a set of statistical methods that fall under the term dimensionality reduction. In practice, dimensionality reduction is a way to recognize and visualize the dimensions that carry the most information from a multidimensional data by collecting as much variance as possible from the original variables. One of the most well-known methods in dimensionality reduction is Principal Component Analysis (PCA). After we have recognized the principal components by a few matrix transformations, we can observe the components in a 2-dimensional space, which would not be possible without performing dimensionality reduction.

Another method we focus on is Multiple correspondence analysis (MCA), with which we can find suitable transformations from classified variables to continuous scales and then reducing the dimensions with the PCA for visualization purposes.

Task 5: Data wrangling

Done, and the assignment is in the latter part of this file.

Task 5: Analysis

library(dplyr)
human <- read.table("data/human.txt", sep = ",", header = T)

Now that we have wrangled the “human” data, we can start working on the analysis part. The data consists of 155 observations and 8 explanatory variables. These observations are designed to capture a wider representation of the development of a country than merely looking at the economic growth would do. Thus, the data includes variables related gender inequality, life expectancy, eduaction, et cetera. More information about the data can be obtained from here. The below figure provides a graphical overview of the data.

library(GGally)
ggpairs(human)

summary(human)
##      edu.fm          labo.fm          edu.exp         life.exp    
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       gni         maternal.mortality.ratio ado.birth.rate      parli.f     
##  Min.   :   581   Min.   :   1.0           Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5           1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0           Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1           Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0           3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0           Max.   :204.80   Max.   :57.50

From the above visualization, we can observe how the variables interact with each others in terms of Pearson correlation coefficient (right of the diagonal). For instance, if we consider adolescent birth rate (ado.birth.rate), we can easily see that it correlates strongly with the ratio of female and male education (edu.fm), as well as with expected years of education (edu.exp), life expectancy (life.exp), and GNI, and maternal mortality (maternal.mortality.ratio).

From the scatter plots (left of the diagonal), we can observe some correlation with the data points. For example, expected years of education (edu.exp) seems to positively correlate with life expectancy (life.exp). Similarly, for instance expected years of education seems to negatively correlate with maternal mortality.

These information are perhaps more easily visible from a heat map of correlations, represented in the figure below. In the heat map, the darker the color, the more correlation between the variables. Blue color indicates positive correlation, whereas red color indicates negative correlation. If we look at the variables discussed above, we notice that adolescent birth rate correlates rather strongly with the ratio of education (edu.fm), and even more with life expectany, for instance. If we consider expected years of education, we notice that it has a strong positive correlation with life expectancy, and a high negative correlation with maternal mortality. Thus, we can read that education increases the life expectancy, and decreases the maternal mortality.

library(corrplot)
cor(human) %>%
  corrplot(method="color")

Now that we have summarized and observed the data a little, we can move to dimensionality reduction.

PCA performs SVD-decomposition to find the principal components of the data. The first component will explain most of the variance, and the second component, which is orthogonal to the first component, will explain most of the variance after the first component has been removed. This way, we will obtain two dimensions, one represents principal component 1, and the second represents principal component 2.

# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human, scale = FALSE)
pca_human_std <- prcomp(human, scale = TRUE)

sh <- summary(pca_human)
sh_std <- summary(pca_human_std)

pca_pr_sh <- round(100*sh$importance[2, ], digits = 1)
pca_pr_sh_std <- round(100*sh_std$importance[2, ], digits = 1)

pc_lab_sh <- paste0(names(pca_pr_sh), " (", pca_pr_sh, "%)")
pc_lab_sh_std <- paste0(names(pca_pr_sh_std), " (", pca_pr_sh_std, "%)")

# draw a biplot of the principal component representation and the original variables
biplot(pca_human,
       choices = 1:2, 
       cex = c(0.5, 0.5), 
       col = c("grey40", "deeppink2"),
       xlab = pc_lab_sh[1], ylab = pc_lab_sh[2]
       )
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

biplot(pca_human_std, 
       choices = 1:2,
       cex = c(0.5, 0.5),
       col = c("grey40", "deeppink2"),
       xlab = pc_lab_sh_std[1], ylab = pc_lab_sh_std[2],
       )

From the visualizations above, we can observe that it is necessary to standardize the data before applying PCA. As PCA aims to maximize the variance, non-standardized data can seem like some component dictates the variance, when, in truth, some other components might also contribute (first figure). After standardizing the data, the model can actually find the components that maximize the variance (second figure). This is also suggested by the values in the summaries of the PCA (below). Based on the proportion of variances explained by the different principal components, PCA suggests that the first principal component explains practically all of the variance in the non-standardized data. However, when we standardize the data, PCA suggests that the first principal component explains slightly more than half of the variance, whereas the second principal component explains approximately 16% of the variance.

sh; sh_std
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7    PC8
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000 1.0000
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
##                            PC8
## Standard deviation     0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion  1.00000

We can also attempt at interpreting the principal components. I focus on the PCA of the standardized data (second figure above). It seems that life expectancy, and expected years of education (as well as the ratio between male and female expected education) increase in the countries clustered to the left of the space. Similarly, the number of female representatives in the parliament seems to increase to the top left corner of the space. Many countries in the left side of the space seem to be European or East Asian, but there are also a couple of rich countries from the Arabian peninsula. Countries in the top left corner of the space seem to be mostly Scandinavian countries. Countries clustered on right end of the space seem to be mostly from the African continent. In these countries, maternal mortality seems to be high, with also high adolescence birth rate, the two being phenomena that probably go hand in hand.

Finally, my personal interpretation of the first two principal components of the standardized dataset: the first principal component seems to represent life expectancy, and variables that correspond to life expectancy. The second principal component seems to somewhat represent gender inequality.

Tea data

In the final part of this exercise, we investigate a dataset that consists of answers to a questionnaire about tea consumption.

tea <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", 
                  sep = ",", header = T)
str(tea); dim(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : chr  "breakfast" "breakfast" "Not.breakfast" "Not.breakfast" ...
##  $ tea.time        : chr  "Not.tea time" "Not.tea time" "tea time" "Not.tea time" ...
##  $ evening         : chr  "Not.evening" "Not.evening" "evening" "Not.evening" ...
##  $ lunch           : chr  "Not.lunch" "Not.lunch" "Not.lunch" "Not.lunch" ...
##  $ dinner          : chr  "Not.dinner" "Not.dinner" "dinner" "dinner" ...
##  $ always          : chr  "Not.always" "Not.always" "Not.always" "Not.always" ...
##  $ home            : chr  "home" "home" "home" "home" ...
##  $ work            : chr  "Not.work" "Not.work" "work" "Not.work" ...
##  $ tearoom         : chr  "Not.tearoom" "Not.tearoom" "Not.tearoom" "Not.tearoom" ...
##  $ friends         : chr  "Not.friends" "Not.friends" "friends" "Not.friends" ...
##  $ resto           : chr  "Not.resto" "Not.resto" "resto" "Not.resto" ...
##  $ pub             : chr  "Not.pub" "Not.pub" "Not.pub" "Not.pub" ...
##  $ Tea             : chr  "black" "black" "Earl Grey" "Earl Grey" ...
##  $ How             : chr  "alone" "milk" "alone" "alone" ...
##  $ sugar           : chr  "sugar" "No.sugar" "No.sugar" "sugar" ...
##  $ how             : chr  "tea bag" "tea bag" "tea bag" "tea bag" ...
##  $ where           : chr  "chain store" "chain store" "chain store" "chain store" ...
##  $ price           : chr  "p_unknown" "p_variable" "p_variable" "p_variable" ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : chr  "M" "F" "F" "M" ...
##  $ SPC             : chr  "middle" "middle" "other worker" "student" ...
##  $ Sport           : chr  "sportsman" "sportsman" "sportsman" "Not.sportsman" ...
##  $ age_Q           : chr  "35-44" "45-59" "45-59" "15-24" ...
##  $ frequency       : chr  "1/day" "1/day" "+2/day" "1/day" ...
##  $ escape.exoticism: chr  "Not.escape-exoticism" "escape-exoticism" "Not.escape-exoticism" "escape-exoticism" ...
##  $ spirituality    : chr  "Not.spirituality" "Not.spirituality" "Not.spirituality" "spirituality" ...
##  $ healthy         : chr  "healthy" "healthy" "healthy" "healthy" ...
##  $ diuretic        : chr  "Not.diuretic" "diuretic" "diuretic" "Not.diuretic" ...
##  $ friendliness    : chr  "Not.friendliness" "Not.friendliness" "friendliness" "Not.friendliness" ...
##  $ iron.absorption : chr  "Not.iron absorption" "Not.iron absorption" "Not.iron absorption" "Not.iron absorption" ...
##  $ feminine        : chr  "Not.feminine" "Not.feminine" "Not.feminine" "Not.feminine" ...
##  $ sophisticated   : chr  "Not.sophisticated" "Not.sophisticated" "Not.sophisticated" "sophisticated" ...
##  $ slimming        : chr  "No.slimming" "No.slimming" "No.slimming" "No.slimming" ...
##  $ exciting        : chr  "No.exciting" "exciting" "No.exciting" "No.exciting" ...
##  $ relaxing        : chr  "No.relaxing" "No.relaxing" "relaxing" "relaxing" ...
##  $ effect.on.health: chr  "No.effect on health" "No.effect on health" "No.effect on health" "No.effect on health" ...
## [1] 300  36

All the variables are categorical. Thus, we need to convert them all to factors.

col_names <- names(tea)
tea[,col_names] <- lapply(tea[,col_names] , factor)

str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : Factor w/ 61 levels "15","17","18",..: 24 30 32 8 33 6 22 21 25 22 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
##  $ frequency       : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
# visualize the dataset
library(dplyr)
library(tidyr)
library(ggplot2)

vis <- pivot_longer(tea, cols = everything()) %>% 
  ggplot(aes(value)) + 
  facet_wrap("name", scales = "free") + 
  geom_bar() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 0))

vis

I somehow could not fit the visualization of all the variables in this page so that they would include the labels for the bars. Thus, I select a few of the variables and plot them again with the labels. These selected variables are also used for MCA soon.

library(dplyr)
library(tidyr)
library(ggplot2)

# column names to keep in the dataset
keep <- c("Tea", "How", "how", "sugar", "where", "lunch", "breakfast")

# select the 'keep_columns' to create a new dataset
tea_time <- dplyr::select(tea, one_of(keep))

# visualize the dataset
vis_ <- pivot_longer(tea_time, cols = everything()) %>% 
  ggplot(aes(value)) + 
  facet_wrap("name", scales = "free") + 
  geom_bar() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))

vis_

library(FactoMineR)

mca <- MCA(tea_time, graph = FALSE)

# visualize MCA
plot(mca, invisible=c("ind"), graph.type = "classic", habillage = "quali")

In the visualization above, different colors correspond to different variables in the data. Black indicates the type of tea, pink indicates where the tea was purchased from, green indicates how the tea is packaged, red indicates how it is enjoyed (with milk, lemon, other, or alone), brown indicates whether the tea is consumed at lunch, and grey whether it is enjoyed on breakfast. From the MCA results, we can observe that for instance, green tea is rather typically bought unpackaged from a tea shop, whereas Earl Grey and black tea are more typically bought in tea bags from a chain store. Earl Grey seems to be enjoyed more with milk and sugar, whereas black tea is enjoyed with lemon. It seems that Earl Grey is a good breakfast tea, especially with milk, whereas green tea alone could be better suited outside of lunch or breakfast.



  1. P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7.↩︎